It’s difficult to overstate the extent to which the COVID-19 pandemic has tested the world’s public health infrastructure over the past two years. At the same time, the COVID-19 “experience” has manifested unequally – not just country to country, but even city to city. As one of the most heterogeneous urban areas in the world, New York City provides a fascinating case study into the ways in which socioeconomic status may be associated with, or even mediate, disparities in health outcomes. (For instance, it’s already well-documented that income and race, along with socioeconomic privilege and political ideology, drive inequities in COVID vaccination rate across US cities.)
Knowing that socioeconomic factors have historically been associated with health outcomes, we aim to examine relationships between a range of predictors (e.g. race/ethnicity, education, broadband internet access, household income / occupational income score, public vs. private health insurance) and COVID-19 health outcomes – namely, hospitalizations, deaths, and vaccinations.
Our initial questions are centered on how demographic factors in New York City trend with COVID-19 outcomes. Particularly, what are the effects of predictors such as race/ethnicity or education on COVID-19 hospitalizations or vaccinations? Are there differences in COVID-19 hospitalizations or vaccinations when considering public vs. private health insurance? And are there differences in COVID-19 outcomes based on household income? Our aim is to examine the relationships between these predictors and outcomes in specified geographic areas across New York City.
The Integrated Public Use Microdata Series (IPUMS USA) consists of individual-level data from samples of the US population drawn from the American Community Surveys (ACS) of 2000 - present as well as from fifteen federal censuses from 1850 to 2010. Each record in IPUMS is a person with numerically coded characteristics and “weight” variables indicating how many persons in the population are represented by each record. Samples were created at different times by different investigators, which lead to a variety of documentation conventions. However, IPUMS applies consistent coding and documentation across records to allow for effective analysis over time. A data extraction system exists to allow users to pull particular samples and variables from IPUMS. This project uses demographic and macroeconomic data from the American Community Survey (ACS) 2019 five-year estimate via IPUMS.
While data from IPUMS is recorded at the individual-level, each interview is coded to a particular Public Use Microdata Area (PUMA) geography where the housing unit was located at the time of interview.
The New York City Department of Health and Mental Hygiene (NYC DOHMH) is one of the oldest public health agencies in the United States. Among other responsibilities, the DOHMH monitors the spread of infectious disease in NYC. The Department of Health classified the beginning of the COVID-19 pandemic as February 29, 2020, the date of the first laboratory-confirmed case. Since then, the DOHMH has recorded and reported COVID-19 data on a daily, weekly, or monthly basis. These data include cases, hospitalizations, and deaths by borough, modified Zip Code tabulation area (ZCTA), and demographic factors. As NYC has administered vaccinations for COVID-19, these data have been recorded and made available by borough, ZCTA, and demography. This project uses COVID-19 hospitalization rates, death rates, and vaccination rates by ZCTA in NYC.
The Baruch Geoportal, maintained by the Newman Library at Baruch College, is a repository of geospatial resources including tabular data sets, tutorials, maps, and crosswalks. This project uses a crosswalk data set from Baruch Geoportal to apportion NYC ZCTAs to PUMAs, which allows PUMA-coded data from IPUMS to be analyzed alongside COVID-19 outcome data from DOHMH.
As mentioned above, monthly outcome data were obtained from the NYC DOHMH with data reported at the ZCTA-level geography. First, these data were summed over the time interval March 2020 - ### to obtain one cumulative incidence measure per ZCTA. However, the predictor variables from IPUMS were coded to the PUMA-level geography. The first step in the cleaning process was to convert these data into a common geography to facilitate further analysis. The Baruch ZCTA-PUMA crosswalk data set was used in this data conversion. Below is a brief excerpt:
# Read in ZCTA/PUMA crosswalk
read_csv("./data/zcta_puma_cross.csv") %>%
head() %>%
knitr::kable()
| zcta10 | stateco | alloc | puma10 | pumaname | pop2010 | per_in_puma | per_of_puma |
|---|---|---|---|---|---|---|---|
| 10451 | 36005 | x | 3710 | NYC-Bronx Community District 1 & 2–Hunts Point, Longwood & Melrose | 22027 | 0.482 | 0.141 |
| 10451 | 36005 | NA | 3708 | NYC-Bronx Community District 4–Concourse, Highbridge & Mount Eden | 21002 | 0.459 | 0.151 |
| 10451 | 36005 | NA | 3705 | NYC-Bronx Community District 3 & 6–Belmont, Crotona Park East & East Tremont | 2684 | 0.059 | 0.017 |
| 10452 | 36005 | x | 3708 | NYC-Bronx Community District 4–Concourse, Highbridge & Mount Eden | 71729 | 0.952 | 0.515 |
| 10452 | 36005 | NA | 3707 | NYC-Bronx Community District 5–Morris Heights, Fordham South & Mount Hope | 3642 | 0.048 | 0.027 |
| 10453 | 36005 | x | 3707 | NYC-Bronx Community District 5–Morris Heights, Fordham South & Mount Hope | 77074 | 0.984 | 0.574 |
The following columns were used to convert ZCTA-level outcome data to PUMA-level data:
‘zcta10’ - ZCTA unique identifier
‘puma10’ - PUMA unique identifier
‘per_in_puma’ - percentage of the specified ZCTA that is located within the specified PUMA
‘per_of_puma’ - percentage of the specified PUMA that is occupied by the specified ZCTA
The following shows the mathematical expression used to convert from ZCTA-level outcome data to PUMA-level outcome data:
\[ \sum_{i = 1}^{n} \text{zcta_outcome_data}_{i} \cdot \text{per_in_puma}_{i} \cdot \text{per_of_puma}_{i} = \text{puma_outcome_data} \]
This resulted in one cumulative incidence measure per PUMA (per 100,000 people for hospitalizations and deaths, a percentage for vaccinations).
The demographic dataset from IPUMS originally contained over 350,000 interviews from the NYC area, each coded to a PUMA and given a ‘perwt’ - person weight and ‘hhwt’- household weight. First, predictor variables were renamed and recoded according to the data dictionary (link to data dictionary) such that the following predictor variables were obtained for each interview along with ‘borough’ and ‘puma’ coding:
# Cleaning
jimzip <- function(csv_file, path) {
# create full path to csv file
full_csv <- paste0(path, "/", csv_file)
# append ".zip" to csv file
zip_file <- paste0(full_csv, ".zip")
# unzip file
unzip(zip_file)
# read csv
data_extract <- read_csv(csv_file)
# be sure to remove file once unzipped (it will live in working directory)
on.exit(file.remove(csv_file))
# output data
data_extract
}
# Apply function to filtered census data CSV
census_data <- jimzip("census_filtered.csv", "./data")
# Merging the Outcome Data
# Read in PUMA outcomes data
health_data <-
read_csv("./data/outcome_puma.csv")
# Merge census data with PUMA outcomes data
merged_data <- merge(census_data, health_data, by = "puma")
# Deprecate census data alone
rm(census_data)
## Cleaning the Data
# Clean the merged census and outcomes data
# Each row represents one
cleaned_data =
merged_data %>%
# Remove variables less useful for analysis or redundant (high probability of collinearity with remaining variables)
select(-serial, -cluster, -strata, -multyear, -ancestr1, -ancestr2, -labforce, -occ, -ind, -incwage, -occscore, -pwpuma00, -ftotinc, -hcovpub) %>%
# Remove duplicate rows, if any
distinct() %>%
# Rename variables
rename(
borough = countyfip,
has_broadband = cihispeed,
birthplace = bpl,
education = educd,
employment = empstat,
personal_income = inctot,
work_transport = tranwork,
household_income = hhincome,
on_foodstamps = foodstmp,
family_size = famsize,
num_children = nchild,
US_citizen = citizen,
puma_vacc_rate = puma_vacc_per,
on_welfare = incwelfr,
poverty_threshold = poverty
) %>%
# Recode variables according to data dictionary
mutate(
# Researched mapping for county
borough = recode(
borough,
"5" = "Bronx",
"47" = "Brooklyn",
"61" = "Manhattan",
"81" = "Queens",
"85" = "Staten Island"
),
rent = ifelse(
rent == 9999, 0,
rent
),
household_income = ifelse(
household_income %in% c(9999998,9999999), NA,
household_income
),
on_foodstamps = recode(
on_foodstamps,
"1" = "No",
"2" = "Yes"
),
has_broadband = case_when(
has_broadband == "20" ~ "No",
has_broadband != "20" ~ "Yes"
),
sex = recode(
sex,
"1" = "Male",
"2" = "Female"
),
# Collapse Hispanic observation into race observation
race = case_when(
race == "1" ~ "White",
race == "2" ~ "Black",
race == "3" ~ "American Indian",
race %in% c(4,5,6) ~ "Asian and Pacific Islander",
race == 7 & hispan %in% c(1,2,3,4) ~ "Hispanic",
race == 7 & hispan %in% c(0,9) ~ "Other",
race %in% c(8,9) ~ "2+ races"
),
birthplace = case_when(
birthplace %in% 1:120 ~"US",
birthplace %in% 121:950 ~ "Non-US",
birthplace == 999 ~"Unknown"
),
US_citizen = case_when(
US_citizen %in% c(1,2) ~ "Yes",
US_citizen %in% 3:8 ~"No",
US_citizen %in% c(0,9) ~ "Unknown"
),
# Chose languages based on highest frequency observed
language = case_when(
language == "1" ~ "English",
language == "12" ~ "Spanish",
language == "43" ~ "Chinese",
language == "0" ~ "Unknown",
language == "31" ~ "Hindi",
!language %in% c(1,12,43,0,31) ~ "Other"
),
# Collapse multiple health insurance variables into single variable
health_insurance = case_when(
hcovany == 1 ~ "None",
hcovany == 2 && hcovpriv == 2 ~ "Private",
hcovany == 2 && hcovpriv == 1 ~ "Public"
),
education = case_when(
education %in% 2:61 ~ "Less Than HS Graduate",
education %in% 62:64 ~ "HS Graduate",
education %in% 65:100 ~ "Some College",
education %in% 110:113 ~ "Some College",
education == 101 ~ "Bachelor's Degree",
education %in% 114:116 ~ "Post-Graduate Degree",
education %in% c(0,1,999) ~ "Unknown"
),
employment = case_when(
employment %in% c(0,3) ~ "Not in labor force",
employment == 1 ~ "Employed",
employment == 2 ~ "Unemployed"
),
personal_income = ifelse(
personal_income %in% c(9999998,9999999), NA,
personal_income
),
household_income = ifelse(
household_income %in% c(9999998,9999999), NA,
household_income
),
on_welfare = case_when(
on_welfare > 0 ~ "Yes",
on_welfare == 0 ~ "No"
),
poverty_threshold = case_when(
poverty_threshold >= 100 ~ "Above",
poverty_threshold < 100 ~ "Below"
),
work_transport = case_when(
work_transport %in% c(31:37, 39) ~ "Public Transit",
work_transport %in% c(10:20, 38) ~ "Private Vehicle",
work_transport == 50 ~ "Bicycle",
work_transport == 60 ~ "Walking",
work_transport == 80 ~ "Worked From Home",
work_transport %in% c(0, 70) ~ "Other"
)
) %>%
# Eliminate columns no longer needed after transformation
select(-hispan, -hcovany, -hcovpriv) %>%
# Relocate new columns
relocate(health_insurance, .before = personal_income) %>%
relocate(poverty_threshold, .before = work_transport) %>%
relocate(on_welfare, .before = poverty_threshold) %>%
relocate(perwt, .before = hhwt) %>%
# Create factor variables where applicable
mutate(across(.cols = c(puma, borough, on_foodstamps, has_broadband, sex, race, birthplace, US_citizen, language, health_insurance, education, employment, on_welfare, poverty_threshold, work_transport), as.factor)) %>%
# Ensure consistent use of percentages
mutate(
puma_death_rate = puma_death_rate / 100,
puma_hosp_rate = puma_hosp_rate / 100
)
# View first few rows
cleaned_data %>%
head() %>%
knitr::kable()
| puma | perwt | hhwt | borough | rent | household_income | on_foodstamps | has_broadband | family_size | num_children | sex | age | race | birthplace | US_citizen | language | education | employment | health_insurance | personal_income | on_welfare | poverty_threshold | work_transport | puma_death_rate | puma_hosp_rate | puma_vacc_rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3701 | 18 | 25 | Bronx | 0 | 166870 | No | No | 2 | 0 | Male | 58 | White | Non-US | Yes | English | Post-Graduate Degree | Employed | Private | 83435 | No | Above | Public Transit | 3.982366 | 10.64624 | 55.79213 |
| 3701 | 22 | 22 | Bronx | 1018 | 123192 | No | Yes | 2 | 0 | Male | 52 | Black | Non-US | Yes | Other | Some College | Employed | Private | 107920 | No | Above | Private Vehicle | 3.982366 | 10.64624 | 55.79213 |
| 3701 | 16 | 17 | Bronx | 2000 | 125000 | No | Yes | 1 | 0 | Female | 59 | White | US | Unknown | Spanish | Some College | Employed | Private | 125000 | No | Above | Public Transit | 3.982366 | 10.64624 | 55.79213 |
| 3701 | 4 | 6 | Bronx | 0 | 140588 | Yes | Yes | 6 | 3 | Male | 64 | Asian and Pacific Islander | Non-US | Yes | Chinese | Post-Graduate Degree | Not in labor force | Private | 11264 | No | Above | Other | 3.982366 | 10.64624 | 55.79213 |
| 3701 | 20 | 19 | Bronx | 1058 | 52876 | No | Yes | 2 | 1 | Male | 44 | White | Non-US | Yes | Spanish | Some College | Employed | Private | 32373 | No | Above | Public Transit | 3.982366 | 10.64624 | 55.79213 |
| 3701 | 9 | 9 | Bronx | 0 | 26101 | No | Yes | 1 | 0 | Female | 64 | White | US | Unknown | English | Post-Graduate Degree | Employed | Private | 26101 | No | Above | Private Vehicle | 3.982366 | 10.64624 | 55.79213 |
After cleaning variable names the 350,000+ interviews were summarized to the PUMA-level based on ‘perwt’ and ‘hhwt’. For each interview, the ‘perwt’ value describes the number of persons in the coded geographic area (PUMA) for which the interview is representative. For example, a ‘perwt’ of 34 indicates that there are 34 persons in the PUMA that share the same characteristics as described in the interview data (e.g., similar income, race, family size, etc.). The same is true of the ‘hhwt’ value for households in the coded PUMA. Therefore, the ‘perwt’ and ‘hhwt’ of each interview can be used to aggregate interview-level data to PUMA-level data as follows:
# Example data frame with weightings for summary stats over each PUMA
nyc_puma_summary = cleaned_data %>%
# Note: do we need to filter to one individual per household for household weightings?
group_by(puma) %>%
summarize(
total_people = sum(perwt),
median_household_income = weighted.median(household_income, hhwt, na.rm = TRUE),
perc_foodstamps = sum(hhwt[on_foodstamps == "Yes"]) * 100 / sum(hhwt),
perc_broadband = sum(hhwt[has_broadband == "Yes"]) * 100 / sum(hhwt),
perc_male = sum(perwt[sex == "Male"]) * 100 / sum(perwt),
median_age = weighted.median(age, perwt, na.rm = TRUE),
perc_white = sum(perwt[race == "White"]) * 100 / sum(perwt),
perc_foreign_born = sum(perwt[birthplace == "Non-US"]) * 100 / sum(perwt),
perc_citizen = sum(perwt[US_citizen == "Yes"]) * 100 / sum(perwt),
perc_english = sum(perwt[language == "English"]) * 100 / sum(perwt),
perc_college = sum(perwt[education %in% c("Some College", "Bachelor's Degree", "Post-Graduate Degree")]) * 100 / sum(perwt),
perc_unemployed = sum(perwt[employment == "Unemployed"]) * 100 / sum(perwt),
perc_insured = sum(perwt[health_insurance %in% c("Private", "Public")]) * 100 / sum(perwt),
median_personal_income = weighted.median(personal_income, perwt, na.rm = TRUE),
perc_welfare = sum(perwt[on_welfare == "Yes"]) * 100 / sum(perwt),
perc_poverty = sum(perwt[poverty_threshold == "Below"]) * 100 / sum(perwt),
perc_public_transit = sum(perwt[work_transport == "Public Transit"]) * 100 / sum(perwt),
covid_hosp_rate = median(puma_hosp_rate),
covid_vax_rate = median(puma_vacc_rate),
covid_death_rate = median(puma_death_rate)
)
# View first few rows of PUMA-summarized data frame
nyc_puma_summary %>% head() %>% knitr::kable()
| puma | total_people | median_household_income | perc_foodstamps | perc_broadband | perc_male | median_age | perc_white | perc_foreign_born | perc_citizen | perc_english | perc_college | perc_unemployed | perc_insured | median_personal_income | perc_welfare | perc_poverty | perc_public_transit | covid_hosp_rate | covid_vax_rate | covid_death_rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3701 | 110016 | 69000 | 24.49306 | 89.50725 | 46.88500 | 39 | 46.53050 | 34.50407 | 20.26887 | 41.74120 | 47.96393 | 3.449498 | 93.91725 | 22357.09 | 20.21706 | 22.66943 | 24.79821 | 10.646245 | 55.79213 | 3.982366 |
| 3702 | 151067 | 68610 | 28.24215 | 82.93904 | 45.12435 | 36 | 14.06925 | 42.97100 | 27.86909 | 66.47183 | 39.83729 | 4.534412 | 92.58276 | 20859.00 | 22.08623 | 18.60036 | 21.55732 | 11.782181 | 47.19203 | 2.688261 |
| 3703 | 119529 | 77588 | 15.90074 | 86.60915 | 46.66734 | 43 | 41.20088 | 22.69324 | 16.91138 | 58.42432 | 45.13131 | 3.577374 | 94.11273 | 25030.00 | 16.73067 | 16.28140 | 19.85292 | 13.040106 | 58.33806 | 4.052719 |
| 3704 | 126664 | 64662 | 23.88688 | 81.56041 | 47.66311 | 37 | 32.72911 | 37.88685 | 23.03812 | 41.10165 | 40.30664 | 4.141666 | 92.33879 | 20362.00 | 20.17700 | 18.97698 | 23.55681 | 6.861973 | 29.38498 | 2.092428 |
| 3705 | 171016 | 36500 | 50.66455 | 87.59563 | 46.19509 | 30 | 15.45177 | 33.49043 | 16.46103 | 33.31033 | 28.35290 | 5.289563 | 91.41484 | 11264.00 | 29.20428 | 39.26943 | 23.03995 | 8.255382 | 33.17926 | 2.018467 |
| 3706 | 131986 | 43490 | 46.18132 | 82.15610 | 48.26724 | 32 | 19.37099 | 46.80496 | 20.58855 | 20.86812 | 29.15536 | 5.606655 | 89.18219 | 15000.00 | 25.09812 | 32.16402 | 28.97883 | 7.723878 | 32.95661 | 1.812561 |
This cleaned and aggregated data set has 55 rows, one for each PUMA, and is the basis of the exploratory analysis that follows.
Ultimately, we ended the data cleaning process with two data sets at different levels of aggregation: one at the census interview level, and another at the PUMA level. Most of our exploratory analysis focused on evaluating predictors and outcomes by PUMA, although the interview-level data was still helpful for us to determine outcome disparities across predictors city-wide.
We began by seeking a better understanding of how each of our three key outcomes – hospitalization rate, death rate, and vaccination rate – differ by PUMA. When PUMAs are ranked from lowest outcome rate to highest outcome rate, and colored by borough, we find the following: * Hospitalization rate ranges from nearly 4% to nearly 20%, with the plurality of high hospitalization rates occurring in Queens PUMAs and the plurality of low hospitalization rates occurring in Manhattan and Brooklyn PUMAs * Death rate ranges from <1% to over 6%, with the plurality of high death rates occurring in Queens PUMAs and the plurality of low death rates occurring in Manhattan and Brooklyn PUMAs * Vaccination rate ranges from nearly X% to over 100% (due to migrations between PUMAs), with all of the highest vaccination rates occurring in Manhattan and Queens, and all of the lowest vaccination rates occurring in Brooklyn and the Bronx
Although our regression modeling explores the relationship between particular variable pairings more fully to check for collinearity, we also were interested in observing how associated our key outcome variables were, and found that hospitalization and death rate were significantly correlated, whereas vaccination had limited correlation with both hospitalization (0.187) and death (0.075) at the PUMA level. Interestingly, the correlation between our key outcome variables was effectively modified by borough; for example, vaccination and hospitalization were significantly correlated in the Bronx (0.930), whereas hospitalization and death were not significantly correlated in Staten Island (0.526).
After exploring outcomes at the PUMA level, we were keen to dive more deeply into disparities by borough. Beyond simply confirming with boxplots the distribution of key outcomes across PUMAs in a given borough (e.g. hospitalization and death distributed towards higher rates in Queens and Bronx PUMAs, vaccination distributed towards higher rates in Queens and Manhattan), we were also curious to see what would happen if we took the median city-wide PUMA for each outcome, and binarily divided the PUMAs in each borough into those above the city-wide median and those below it. Although we’re working with only 55 PUMAs, which makes our borough percentages highly sensitive to any given PUMA given small sample size, we find – to take one example, that ~79% of PUMAs in Queens are above the city-wide median PUMA on death rate, but ~86% of PUMAs in Queens are above the city-wide median PUMA on vaccination rate.
| Borough | Total PUMAs | % Above Hosp Median | % Above Death Median | % Above Vax Median |
|---|---|---|---|---|
| Bronx | 10 | 70.0 | 50.0 | 20.0 |
| Brooklyn | 18 | 22.2 | 38.9 | 16.7 |
| Manhattan | 10 | 30.0 | 30.0 | 80.0 |
| Queens | 14 | 85.7 | 78.6 | 85.7 |
| Staten Island | 3 | 33.3 | 33.3 | 66.7 |
For each unique combination of age group, race, and sex, we wanted to determine which demographic category performed best and worst on each key outcome, and populated the following tables:
# Lowest hospitalization rates
race_age_sex %>%
filter(outcome == "hosp_rate") %>%
mutate(
outcome_rate = outcome_rate / 100
) %>%
arrange(outcome_rate) %>%
select(race, age_class, sex, outcome, outcome_rate) %>%
head() %>%
knitr::kable()
| race | age_class | sex | outcome | outcome_rate |
|---|---|---|---|---|
| White | 21-30 | Female | hosp_rate | 0.0876151 |
| White | 31-40 | Male | hosp_rate | 0.0885909 |
| White | 31-40 | Female | hosp_rate | 0.0890869 |
| White | 21-30 | Male | hosp_rate | 0.0896349 |
| White | 41-50 | Male | hosp_rate | 0.0927537 |
| White | 41-50 | Female | hosp_rate | 0.0934757 |
# Highest hospitalization rates
race_age_sex %>%
filter(outcome == "hosp_rate") %>%
mutate(
outcome_rate = outcome_rate / 100
) %>%
arrange(desc(outcome_rate)) %>%
select(race, age_class, sex, outcome, outcome_rate) %>%
head() %>%
knitr::kable()
| race | age_class | sex | outcome | outcome_rate |
|---|---|---|---|---|
| American Indian | 81-90 | Male | hosp_rate | 0.1272494 |
| Other | 61-70 | Male | hosp_rate | 0.1216952 |
| Other | 11-20 | Male | hosp_rate | 0.1211757 |
| Other | 41-50 | Male | hosp_rate | 0.1200270 |
| Other | 61-70 | Female | hosp_rate | 0.1185198 |
| Asian and Pacific Islander | 91-100 | Male | hosp_rate | 0.1173874 |
# Lowest death rates
race_age_sex %>%
filter(outcome == "death_rate") %>%
mutate(
outcome_rate = outcome_rate / 100
) %>%
arrange(outcome_rate) %>%
select(race, age_class, sex, outcome, outcome_rate) %>%
head() %>%
knitr::kable()
| race | age_class | sex | outcome | outcome_rate |
|---|---|---|---|---|
| White | 21-30 | Female | death_rate | 0.0237663 |
| White | 31-40 | Male | death_rate | 0.0242434 |
| White | 21-30 | Male | death_rate | 0.0243574 |
| White | 31-40 | Female | death_rate | 0.0245547 |
| Other | 81-90 | Male | death_rate | 0.0254102 |
| White | 41-50 | Male | death_rate | 0.0256468 |
# Highest death rates
race_age_sex %>%
filter(outcome == "death_rate") %>%
mutate(
outcome_rate = outcome_rate / 100
) %>%
arrange(desc(outcome_rate)) %>%
select(race, age_class, sex, outcome, outcome_rate) %>%
head() %>%
knitr::kable()
| race | age_class | sex | outcome | outcome_rate |
|---|---|---|---|---|
| 2+ races | 91-100 | Male | death_rate | 0.0352503 |
| Asian and Pacific Islander | 91-100 | Male | death_rate | 0.0338679 |
| American Indian | 81-90 | Male | death_rate | 0.0324616 |
| Other | 11-20 | Male | death_rate | 0.0322860 |
| Other | 41-50 | Male | death_rate | 0.0322078 |
| Other | 71-80 | Male | death_rate | 0.0320242 |
# Lowest vax rates
race_age_sex %>%
filter(outcome == "vax_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(outcome_rate) %>%
select(race, age_class, sex, outcome, outcome_rate) %>%
head() %>%
knitr::kable()
| race | age_class | sex | outcome | outcome_rate |
|---|---|---|---|---|
| American Indian | 71-80 | Male | vax_rate | 49.32433 |
| Black | 11-20 | Female | vax_rate | 49.35445 |
| Black | <11 | Male | vax_rate | 49.47578 |
| Black | 11-20 | Male | vax_rate | 49.48017 |
| Black | 31-40 | Female | vax_rate | 49.49496 |
| Black | <11 | Female | vax_rate | 49.51926 |
# Highest vax rates
race_age_sex %>%
filter(outcome == "vax_rate") %>%
mutate(
outcome_rate = outcome_rate
) %>%
arrange(desc(outcome_rate)) %>%
select(race, age_class, sex, outcome, outcome_rate) %>%
head() %>%
knitr::kable()
| race | age_class | sex | outcome | outcome_rate |
|---|---|---|---|---|
| Asian and Pacific Islander | 91-100 | Female | vax_rate | 66.13596 |
| Asian and Pacific Islander | 71-80 | Female | vax_rate | 65.89982 |
| Asian and Pacific Islander | 71-80 | Male | vax_rate | 65.73426 |
| Asian and Pacific Islander | 31-40 | Male | vax_rate | 65.61553 |
| Asian and Pacific Islander | 31-40 | Female | vax_rate | 65.57662 |
| 2+ races | 81-90 | Male | vax_rate | 65.47910 |
Overall, hospitalization and death rates were lowest (best) among white males and females under age 30, whereas vaccination rates were highest (best) among Asian and Pacific Islander males and females. Older American Indian individuals, along with younger and middle-aged Black individuals, tended to have the lowest vaccination rates, while mixed race, American Indian, and “other” racial groups tended to have higher hospitalization and death rates.
After exploring disparities in key outcomes across PUMAs and boroughs, we turn towards our large set of predictors, and begin by trying to determine which predictors to focus on using a correlation matrix (with outcomes). In the plot below, we include text only on those tiles that are highly significant, with p < 0.01.
Interestingly, we find: * The variables highly positively correlated with hospitalization rate at the PUMA level are percent US citizens and percent foreign born; the variables highly negatively correlated with hospitalization rate at the PUMA level are percent white, percent using public transit to get to work, percent with health insurance, percent English-speaking at home, percent with college education, percent with broadband access, and median personal and household incomes. * The variables highly positively correlated with death rate at the PUMA level, are percent US citizens and percent foreign born; the variables highly negatively correlated with death rate at the PUMA level are percent using public transit to get to work, percent with health insurance, percent college educated, and median personal and household incomes. * The variables highly positively correlated with vaccination rate at the PUMA level are percent white, percent male, percent college educated, percent with broadband access, median age, and median personal and household incomes; the variables highly negatively correlated with vaccination rate at the PUMA level are percent on welfare, percent unemployed, percent below the poverty threshold, and percent on food stamps
All in all, there are a couple of interesting trends observed in this correlation matrix. First, income seems strongly associated with all three outcome variables. In addition, the variables strongly associated with hospitalization rate also tend to be strongly associated with death rate, which is not much of a surprise given the high correlation already observed between hospitalization and death rate across PUMAs. Finally, we note that many signifiers of socioeconomic status – including poverty level, welfare, unemployment, and food stamp utilization – seem highly correlated with vaccination, but not with hospitalization and death. Given that vaccination is a more “active” outcome (requiring deliberate action) here than hospitalization or death, which are both the result of (in all likelihood) “passive” or indeliberate transmission, the association of socioeconomic factors with vaccination begin to indicate the potential impact from significant structural inequalities at play, hindering healthcare access among the poor.
We then selected each of the four variables with highest correlation (positive or negative) to each predictor, excluding obvious redundancies (like personal income and household income), and explored specific association trends between predictor and outcome, colored by borough.
Beyond the predictors significantly associated with each outcome, we wanted to focus as well on how outcomes varied by levels of key socioeconomic variables – namely, race, age group, and sex. Because we lack individual outcome data (i.e. each census observation within a given PUMA has the same PUMA-level hospitalization, death, and vaccination rate), we assumed for this analysis that all persons in a given PUMA had equal likelihood of a particular outcome (hospitalization, death, or vaccination) being true, with the likelihood corresponding to the PUMA outcome rate.
Some key findings include: * Males and females appear similar on each outcome * Hospitalization and death rates tend to be higher after middle age compared to those below middle age, whereas vaccination rate shows a general increase with each age group * In general, white individuals have a lower likelihood of hospitalization and death, but a higher likelihood – along with Asian and Pacific Islanders – of vaccination, whereas other groups of color and Native Americans seem to have higher hospitalization and death rates, but lower vaccination rates
Similarly, we examined how key outcomes varied across categories of a few seemingly important predictor variables for each outcome observed in our correlation matrix:
Again, there were a few interesting findings here, such as: * We observe a monotonic decrease in hospitalization and death rate as household income increases, but a monotonic increase in vaccination rate as household income increases as well * Individuals with public health insurance tend to perform similarly on key outcomes to those with no health insurance at all, with both groups worse than those who have private health insurance * Individuals with college and graduate education tend to perform better on key outcomes than those without any college education * Those with unknown citizenship status, which may signify being “undocumented,” tend to have lower death rates – perhaps indicative of underreporting due to fear of immigration control / policing, i.e. attention on the person or family * As already noted, those on welfare and foodstamps, as well as unemployed, are significantly less likely to be vaccinated against COVID-19
To build on the work above, we wanted to observe disparities within boroughs, knowing that different boroughs have different demographic compositions and that the ideal visualizations would allow us to understand how outcomes vary conditioned on borough demographic composition. Each of the following visualizations has three panels: number of people with outcome, categorized by borough and colored by predictor level; % of people with outcome in each borough, colored by predictor level, compared to the overall composition of the borough by predictor level; and percent with outcome variable, in each borough, plotted by level of predictor.
The most interesting finding here is that Manhattan seems to have the most significant racial disparities on hospitalization rate and death rate, along with age disparities on hospitalization rate, and racial and age disparities in vaccination rate. In general, the level of inequality on key outcomes across demographics in Manhattan tends to be higher than in other boroughs.
Another set of data visualizations that shows how different outcome rates vary within boroughs across levels of a key demographic predictor (age group, sex, race) is included below.
TO DO!